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Abstract 

Background: Pattern mining for biological sequences is an important problem in bioinformatics and 
computational biology. Biological data mining yield impact in diverse biological fields, such as discovery of 
co-occurring biosequences, which is important for biological data analyses. The approaches of mining sequential 
patterns can discover all-length motifs of biological sequences. Nevertheless, traditional approaches of mining 
sequential patterns inefficiently mine DNA and protein data since the data have fewer letters and lengthy 
sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative 
regions, which are not conserved in evolution of biological sequences. 

Results: We devise an approach to efficiently mine sequential patterns (motifs) with gap constraints in biological 
sequences. The approach is the Depth-First Spelling algorithm for mining sequential patterns of biological 
sequences with Gap constraints (termed DFSG). 

Conclusions: PrefixSpan is one of the most efficient methods in traditional approaches of mining sequential 
patterns, and it is the basis of GenPrefixSpan. GenPrefixSpan is an approach built on PrefixSpan with gap constraints, 
and therefore we compare DFSG with GenPrefixSpan. In the experimental results, DFSG mines biological sequences 
much faster than GenPrefixSpan. 



Background 

Pattern mining has numerous applications, such as pur- 
chasing pattern mining, biological pattern mining, and 
Web log pattern mining. Therefore the academic com- 
munity has devised useful methods to mine patterns, e.g., 
mining traditional sequential patterns [1-4], maximal 
sequential patterns [5], closed sequential patterns [6], 
sequential patterns of data streams [7], incremental 
sequential patterns [8], and progressive sequential pat- 
terns [9]. Traditional sequential pattern mining methods 
discover general sequential patterns, which can be 
applied to various constraints. The methods of mining 
traditional sequential patterns have two famous types of 
algorithms from technical view. The two types are 
apriori-based methods [1,2] and projection-based pattern 
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growth algorithms [3,4]. The apriori-based methods com- 
bine items into candidate patterns, and then the methods 
validate the patterns. The projection-based pattern 
growth algorithms scan all sequences and project pat- 
terns recursively. The data formats of traditional methods 
are divided into horizontal data formats and vertical data 
formats. 

Traditional sequential pattern mining methods dis- 
cover 2 Z subsequences of a sequential pattern with length 
/. The numbers of subsequences for a sequential pattern 
are too large in traditional mining methods, and there- 
fore the maximal sequential pattern mining method [5] is 
proposed to efficiently identify maximal sequential pat- 
terns, which have no frequent supersequences. Another 
alternative is to mine closed sequential patterns [6], 
which patterns do not have any frequent supersequences 
with the same occurrence frequency. The closed sequen- 
tial patterns not only largely reduce the number of 
reported sequential patterns, but also preserve the 
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expressive power of traditional mining algorithms since 
the subsequences of a closed sequential pattern are easily 
derived. Mining sequential patterns of data streams [7] is 
in a different environment and has some additional con- 
straints, such as strictly restricted memory, continuously 
identified sequential patterns, and a linear time 
execution. 

Incremental databases are formed with newly added 
sequences. The incremental sequential pattern mining 
algorithm [8] is devised to efficiently mine incremental 
databases since many real data grow incrementally. Most 
users are usually interested in recent data, and therefore 
the progressive sequential pattern mining algorithm [9] 
generates sequential patterns in a period of interest. The 
method can find newly arriving sequential patterns and 
discard obsolete sequential patterns. Data mining tech- 
nology has been used in bioinformatics domain. For 
example, temporal pattern mining techniques are used to 
mine predictive and non-spurious patterns [10]. Asso- 
ciated functional subgraphs are discovered by a pattern 
mining method [11] in cancer protein-protein interaction 
networks. It is increasingly important to develop 
approaches for efficient biological data mining since bio- 
logical sequences are now in widespread use in the field 
of bioinformatics. 

Some important research directions for data mining in 
bioinformatics are discovery of co-occurring biological 
sequences, effectively classifying biological sequences, 
and clustering biological sequences [12-14]. In molecu- 
lar biology, the motifs are functional significance and 
have specific structures which are mined from unaligned 
biological sequences. Mining sequential patterns (motifs) 
promote identifying co-occurring biological sequences 
and discovering relationships in DNA or protein data 
[15,16]. In bioinformatics domain, mining sequential 
patterns (motifs) have shown the usefulness, such as 
classification of biological sequences, prediction of tran- 
scription factor binding sites, recognition of protein 
folds, and identification of hot regions in protein-protein 
interactions. 

The problems of mining sequential patterns correlate 
closely with some traditional problems of computational 
biology [17-20], such as the problems of motif finding 
and those of sequence alignment. In the field of biology, 
biological sequences conserve sequential patterns for 
long evolution, which may be critical functions. The 
2PDF approach [21] first proposed to mine sequential 
patterns of biological sequences, but a large number of 
patterns were generated, and gap constraints were not 
coped. DFSP [22] is a general model of mining sequen- 
tial patterns for biological sequences, but it did not cope 
with gap constraints either. The gap constraints of 
TEIRESIAS algorithm [23] and SPLASH [24] are rigid. 
The TEIRESIAS algorithm has two phases. The first 



phase is the scanning, and the second phase is the con- 
volution. In the scanning phase, TEIRESIAS generates 
all <L, W> patterns, which are at least k support. L is 
the number of least residues, and W is the length of pat- 
terns. In the convolution phase, TEIRESIAS constructs 
maximal patterns from <L, W> patterns. SPLASH is 
another algorithm with the rigid gap constraints. 
SPLASH first builds a seed set, and then extends pat- 
terns recursively. All the final patterns satisfy the density 
constraint. The density constraint denotes that all sub- 
strings of a pattern have length l 0 and at least k 0 full 
characters. 

Next, we briefly introduce the 2PDF method. New 
and different types of patterns are generated by the 
2PDF method. The patterns have the form "Pi*P 2 *... 
*P k *...*P n _i*P n ." Each "Pi" denotes a frequent segment, in 
contrast to the complete set of patterns in traditional 
sequential pattern mining problems. A frequent seg- 
ment represents a segment that is longer than MinLen 
(minimum segment length). The arbitrary lengths of 
items or gaps are represented by one symbol "*". They 
extract segments from all sequences by a generalized 
suffix tree. To generate the pattern tree in the 2PDF 
method, the segment tree (composed of the segments) 
is used. The method mines the complete set of sequen- 
tial patterns in only setting MinLen = 1 for the 2PDF 
method. The complete set of all-length sequential pat- 
terns means the complete set of sequential patterns. 
The complete set of length 1 sequential patterns in 
DNA sequences may be {<A>, <T>, <C>, <G>}. When 
MinLen = 1, the segment tree in the 2PDF method is 
too large. A combinatorial method generated the pat- 
tern tree in the method. Thus, too many patterns (all 
combinations of the "*" position) are generated by these 
techniques. For example, the 2PDF method may gener- 
ate the patterns "abc*d," "ab*cd," 'Vbcd," "ab*c*d," 
"a*bc*d," "a*b*cd," and Vb*c*d" if the DFSG [25] or 
GenPrefixSpan [26] merely generates the pattern 
"a*b*c*d" (without limitation of gap constraints). The 
2PDF method mines too many patterns for biological 
sequences, which are shown in this example. 

The traditional algorithms of mining sequential pat- 
terns [1-4] cope with a large number of items and short 
sequence lengths. Nevertheless, two diverse characteris- 
tics are in DNA and protein data. First, the alphabet of 
DNA data are made up of four letters, and that of pro- 
tein data are made up of twenty letters. Second, the 
DNA and protein data usually have hundreds or thou- 
sands of the sequence lengths. Accordingly, traditional 
approaches of mining sequential patterns difficultly cope 
with small alphabets and lengthy sequences of biological 
sequences. Consequently, traditional algorithms are inef- 
fective for mining biological sequences. Projection-based 
pattern growth algorithms [3,4] are used to process long 
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sequences in traditional sequential pattern mining, but 
they require an extensive running time because they 
need to construct and scan corresponding projected 
databases numerous times to generate long sequential 
patterns. Another type of algorithm, apriori-based meth- 
ods [1,2], are frequently used in traditional sequential 
pattern mining, but they also have a long processing 
time. Moreover, traditional approaches suit to a larger 
number of items and brief sequences, such as supermar- 
ket transactions; accordingly, the traditional approaches 
inefficiently cope with biological data. 

A novel method, the Depth-First Spelling algorithm for 
mining sequential patterns (motifs) with Gap constraints 
in biological sequences (termed DFSG [25]), is devised 
in this work. This paper is mainly added explanations of 
gap constraints, explanations of various sequential pat- 
tern mining approaches, related works for biological 
sequences, explanations of projection-based pattern 
growth algorithms, explanations of the counting matrix 
techniques for the gap constraints, explanations of Gen- 
PrefixSpan, explanations of how to gain our real data, 
more summaries for this work, and references of related 
works for biological sequences and rewritten from the 
proceedings version of our article. Gap constraints are 
contained in DFSG, which is a generalization approach. 
The distance limitation between two separate letters of 
a sequence is a gap constraint. The gap constraint suits 
to the data features with fewer letters and lengthy 
sequences, such as biological sequences. The unrelated 
sections of the biological evolution are skipped by the 
gap constraints. For the gap constraints, a maximal 
number of the distance limitation in the separate letters 
can be assigned by the user. We devise the DFSG 
approach to leave traditional methods of mining sequen- 
tial patterns, and struggles with the problems of the 
long runtime. DFSG need briefer runtime to execute to 
discover motifs of biological data, compared to tradi- 
tional approaches of mining sequential patterns. The 
DFSG approach was evaluated by a large number of 
experiments. First, DFSG and GenPrefixSpan [26] were 
utilized to cope with real and simulated DNA data. 
Afterward the executing time of the two approaches was 
contrasted by using increased values of gap constraints, 
synthetic protein data, and diversified variables in syn- 
thetic biological data. In the experimental results, the 
runtime of the DFSG approach is superior to that of 
GenPrefixSpan in biological data, and DFSG is more 
scalable. 

Some reasons are accounted for the efficient runtime 
and scalability of the DFSG method. We compare the 
DFSG approach with GenPrefixSpan, which is a projec- 
tion-based method of the pattern growth. Corresponding 
projected databases are not needed to build by the DFSG 
approach unlike traditional projection-based methods of 



pattern growth; accordingly, the databases are not needed 
to scan by DFSG. Then, DFSG saves recursive runtime of 
projection and scan. As shown in Figure 1, the executing 
steps of GenPrefixSpan are partially exhibited in an 
example. GenPrefixSpan must first scan all sequences to 
generate frequent items {"a","t","c","g"} in biological 
sequences {X: atacgat, Y: atcacga, Z: taacgea] with the 
minimum support A equal to 3 and the gap constraint 
equal to 3 (Figure 1). Then, the projected databases for 
"a", "t", "c", and "g" are generated individually, as {tacgat, 
cgat, t, tcacga, cga, aegea, cgea}, {acgat, cacga, aacgea}, 
{gat, acga, ga, gca, a}, and {at, a, ca}, respectively. All the 
sequences in the projected database of "a" are scanned by 
GenPrefixSpan to generate frequent items {"a","c","g"} 
after projecting projected database for "a", and then pro- 
jected databases for "aa," "ac," and "ag" are generated 
individually. The GenPrefixSpan approach projects the 
corresponding databases recursively until it can not gen- 
erate any frequent letters. 

Methods 

DFSG algorithm 

This section introduces the Depth-First Spelling algo- 
rithm for Gapped sequential pattern mining of biological 
sequences (referred to as DFSG). DFSG is designed for 
efficient mining sequential patterns of biological 
sequences with gap constraints. The gap constraints are 
critical and have numerous applications in bioinfor- 
matics. A counting matrix Q is proposed to cope with 
gap constraints and it records each position of a latest 
item for a gapped sequential pattern in each sequence. 
The latest item positions in the counting matrix Q must 
satisfy the gap constraint. Each position of the latest item 
for the gapped sequential pattern is recorded in Q since 
each position of the latest item may extend to a next 
sequential pattern with the gap constraint. If the count- 
ing matrix records only one position of the latest item in 
each sequence, the other positions of the latest item may 
miss chances to contribute support counts for the next 
sequential pattern with the gap constraint. If the situation 
causes the support counts of the next gapped sequential 
pattern to be less than the minimum support counts, the 
next gapped sequential pattern will not to be generated, 
and the subsequent gapped sequential patterns will not 
be discovered either. 

All positions of the latest item for a sequence in Q can 
contribute only one support count to the support counts 
of the next gapped sequential pattern since a sequence can 
contribute only one support count to a pattern. If each 
position of the latest item for a sequence in Q can contri- 
bute one support count to the next gapped sequential pat- 
tern, the support counts of the pattern will be larger, and 
this situation may result in wrong reported patterns. A 
sequence can not contribute multiple support counts to a 
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Figure 1 A partial example of the GenPrefixSpan process 



sequential pattern by the definition. In the following, we 
introduce the execution of the DFSG approach. DFSG has 
two performed procedures. First, the three-dimensional 
indices are built by scanning the provided data set once 



for the DFSG approach. Second, DFSG-Generation pro- 
duces gapped sequential patterns for motifs, as shown in 
Figure 2. The spelling manner of candidate-gapped pat- 
terns and the verification of gapped sequential patterns are 
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The DFSG-Generation Algorithm 



Input: The gap constraint G, the support threshold 2, 

and the three-dimensional indices P. 
Output: The sequential patterns with the gap constraint 

1 . for S = 1 to N do /* N is the number of sequences. 1 "/ 

2. Initialize count Jlag= false; 

3. for 1= 1 to I_N do 

4. /* I_N is the number of indices in each sequence.*/ 

5. Initialize searchjlag = false; 

6. Initialize first = 1, 

7 . Initialize last = the length of P(W, S) ; 

8. /* W is the letter. */ 

9. Initialize count = 0; 

10. if C(S,I) has a value then 

11. /* C is toe counting matrix. */ 

12. while fir; f ^ last 

13. m = (^rjf+Ja^/2; 

14. if OS J.) less than FXW, £ m) then 

15. near = P(W, S, m), 

16. Jasf = m - 1; 

1 7 . search Jlag = true ; 

18. else 

19. first=m + l, 

20 . end if 
21 end while 

22. end if 

23 . if search Jlag e quals true 

and near < C{SJ)+G then 

24. C(SJ) = near; 

25 . count Jlag = true ; 

26. else 

27. CY5;/; = null, 
28 end if 

29. end for 

30 . if count Jlag equals true then 

31. count = count + 1 ; 

32. end if 

33. if count «S A then 

34. break; 

35. end if 
36". end for 

37. if count X then 

38. for each letter W do, 

39 . call DFSG- Generation (W, C r P); 

40. end for 



41 . end if 



Figure 2 The DFSG-Generation Algorithm. 



included in the DFSG-Generation operation. Direct access Therefore, we designed the counting matrix and the three- 

and binary search with the three-dimensional indices are dimensional indices for the DFSG-Generation operation, 

contained in the procedure of verification. The prefix of The counting matrix cannot store the succeeding appear- 

each item is depended by the succeeding appearance point ance point ahead since there are too large and unknown 

of the item for each motif-producing procedure. feasible points for the succeeding appearance. 
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Definition 1. The set of all items is E, which equals 
{ej, e 2 , e A ). It simulates DNA sequence when A 
equals four. Furthermore, it simulates protein sequence 
when A equals twenty. Let a sequence s be the ordered 
list of items. We denote s = {sxs 2 s 3 ...s n j, where s, is an 
item. Biological sequences usually have long lengths, 
and an identical item can occur many times in a 
sequence. 

Definition 2. We denoted a sequence u = {uxu 2 u 3 ... 
u q j, where is an item. A sequence s contains u if 
{u 1 u 2 u 3 ...u q j is sequentially mapped to {s 1 s 2 s 3 ...sj {q=n). 
One subsequence of s is u in the above condition. 

Definition 3. The support count of a pattern a is the 
number of sequences, which contain the pattern a in 
the database. If the support of pattern a is larger than 
the minimum support, this pattern is called a gapped 
motif. In general, the problem of mining gapped motifs 
does not confine any categories of biological sequences. 

Definition 4. We denote a motif p = {pip 2 p3...p m }, 
where p m is an item. If a sequence s can contribute the 
support to the motif p, this motif is one subsequence of 
the sequence s. The item s, of the sequence is mapped 
by pp and the item s k of the sequence is mapped by p j+1 . 
If the St position is less than the s, position plus the gap 
constraint value, the motif p conforms to the gap con- 
straint G. 

Definition 5. The three-dimensional indices are the 
position number W k , the sequence number Ej, and the 
item number A t . The item number A b which appears in 
the sequence number Ej of the biological database is the 
position number W k . 

Definition 6. The counting matrix Ci has multiple posi- 
tion numbers in each sequence number E> for the proce- 
dure of generating gapped motifs. Let a=<t 1 t 2 ...t n _ 1 >be a 
gapped motif with the suffix j3 in the database that 
P=<t 1 t 2 ...t n . 1 t n >is a sequence with the prefix a. The multi- 
ple position numbers form the counting matrix Q of 
the suffix p. The position number W k is determined by 
the item <t n >'m each sequence number Ej of the three- 
dimensional indices. The position number W k of the suffix 
ft must be greater than the counting matrix Q of the pre- 
fix a. The updated position numbers of the present letter 
in sequences conform to the gap constraint G for the 
gapped motif, and they are recorded by the counting 
matrix Q. 

Definition 7. The support of the suffix fi is the num- 
ber of rows, which have at least one value in the count- 
ing matrix Ci of J3. If the minimum support is less than 
the support y of ft, the sequence [3 is certainly a gapped 
motif. 

An example of DFSG 

The following is a demonstrated example of performing 
the DFSG approach. A set of items {A, T, C, G} for the 



biological sequence database D {X: ATACGAT, Y: 
ATCACGA, Z: TAACGCA} is mined by using the DFSG 
algorithm. For the DFSG example, the minimum sup- 
port A is equal to 3, and the gap constraint is also equal 
to 3. The performed procedures are in the following 
context. The three-dimensional indices are constructed 
by using the DFSG approach, which reads the biological 
sequence database once in the first procedure. The 
DFSG approach reads the biological sequence S, and 
puts the position W k of the read item A t into the three- 
dimensional indices. According to Figure 3, the DFSG- 
Generation operation discovers gapped motifs of biological 
sequences in the second procedure. The DFSG approach 
spells item / to generate candidate-gapped motifs a for 
biological sequences in a depth-first manner. The support 
counts of candidate-gapped motifs are verified by using 
the counting matrix Q and the three-dimensional indices. 
If the support counts of candidate-gapped motifs are 
greater than the minimum support count, the recursive 
execution of the procedure is continued, and the motifs 
are gapped motifs of biological sequences. 

The counting matrix Q of "C is {(4), (3, 5), (4,6)} since 
"C" occurs in the position (4) of sequence X, the posi- 
tions (3,5) of sequence Y, and the positions (4,6) of 
sequence Z. In the initial stage, all the positions of "C" 
are in the counting matrix Q of "C and satisfy the gap 
constraint. As shown in Figure 3, the candidate-gapped 
motif "C'A" is spelt by the DFSG approach in a depth- 
first manner. DFSG searches the positions in dimension 
"A" of the three-dimensional indices to detect minimum 
positions that are greater than the positions in the 
counting matrix and satisfy the gap constraint. The cur- 
rent counting matrix {(6), (4,7), (7)} is greater than the 
former counting matrix {(4), (3, 5), (4,6)}, and all the posi- 
tions in the new CI satisfy the gap constraint 3 since the 
position (6) is less than (9), the positions (4,7) is less 
than (8,10), and the position (7) is less than (9). 

The candidate-gapped motif "C*A" is certainly a 
gapped motif since the support count 3 satisfies the 
minimum support count. A support is regarded to satisfy 
a minimum support when the support is greater than or 
equal to the minimum support. The support of gapped 
pattern "C*A" is 3 since the position (6) of sequence X, 
the positions (4,7) of sequence Y, and the position (7) of 
sequence Z contribute one support count to the gapped 
pattern individually. DFSG continues to depth-first spell 
and verify candidate motifs. Then, we observe another 
candidate motif "A*T." The positions in the "V dimen- 
sion of the indices are searched. The support count is 2 
since the updated counting matrix is {(2,7),(2),(-)}. 
Therefore, the candidate-gapped motif "A"'T' is certainly 
not a gapped motif, and the subsequent candidate- 
gapped motifs of this failed candidate "A*T' are not 
continued to generate. 
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Results and discussion 

Design of experiments 

The DFSG performance was evaluated with a number of 
experiments. In the first part of the experiments, we 
compared the performance of DFSG with that of Gen- 
PrefixSpan in synthetic and real DNA data. GenPrefix- 
Span is a generalized method of PrefixSpan, which uses 
the projected database approach to recursively construct 
sequential patterns and is an efficient algorithm in tradi- 
tional sequential pattern mining. GenPrefixSpan stores 
all subsequences of each frequent item occurrence in 
projected databases to cope with gap constraints. We 
acquired real DNA data from the National Center for 
Biotechnology Information (NCBI), which is national 
resource funded by U.S. government. In the second part 
of the experiments, we tested DFSG and GenPrefixSpan 
with gap constraints, number of sequences, length of 
sequences, and simulated protein sequences. The scal- 
ability of the DFSG algorithm was experimented, too. 
The total experiments were conducted on a 3.20 GHz 
Pentium(R) 4 PC with 1 GB of RAM, and Microsoft 
Windows XP Professional (2002) was the operating sys- 
tem. In order to make fair comparisons, the two pro- 
grams were written in the same environment, Microsoft 
Visual C++ 6.0. 



Synthetic and real DNA data 

DFSG and GenPrefixSpan were evaluated by using real 
DNA data, which are acquired from NCBI. Variables 
used in the experiments are the length of a sequence L, 
the number of letters A, the value of gap constraint G, 
the minimum support S, and the number of sequences N. 
The users can access numerous public databases of mole- 
cular biology from NCBI website. For example, we intro- 
duce how to gain our real data (A = 4, L = 35, and N = 
1000). First, we access NCBI website, http://www.ncbi. 
nlm.nih.gov. Second, the nucleotide database is selected. 
Third, the query is "sequence AND 35:35[Sequence 
Length]". Fourth, the first one thousand sequences are 
crawled and parsed to form our data set. 

The value of A is four for synthetic and real DNA data 
in the experiments of DNA data. Additionally, the values 
of L are twenty-five, thirty, and thirty-five in the experi- 
ments. In Figures 4a-c, DFSG is superior to GenPrefix- 
Span for real DNA data. In the experiments, the values of 
gap constraint are ten, seven, and five; and the number of 
sequences is one thousand. In the figures, the runtime of 
two algorithms is shown on the vertical axis, and the 
minimum support is shown on the horizontal axis. The 
runtime rate is that the runtime of GenPrefixSpan 
divided by DFSG's runtime. The runtime rates are 8.68, 
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Figure 4 Comparison of execution time based on real DNA sequences a Execution Time (L = 25, A = 4, 
Time (L = 30, A = 4, G = 7, and W=1000). c Execution Time [L = 35, A = 4, 6 = 5, and W=1000). 



11.27, 14.77, 20.94, and 30.18 for real DNA data, as 
shown in Figure 4c. The rate grows when the minimum 
support gets larger. This means that DFSG has more 
superior than GenPrefixSpan for high support thresholds 
in mining biological sequences. 



G = 10, and N=1000). b Execution 



The simulated DNA data, which is used in the succes- 
sive experiments, followed the reference [1]. The experi- 
mental results of the synthetic DNA data show that DFSG 
is superior to GenPrefixSpan, as shown in Figures 5a-c. 
We simulated the letters for DNA data in the experiments. 
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Figure 5 Comparison of the execution time based on synthetic DNA sequences, a Execution Time {L = 25, A 
b Execution Time {L = 30, A = 4, G = 5, and N=3000). c Execution Time {L = 35, A = 4, G = 5, and N=3000). 



4, G = 5, and N=3000). 



The number of letters is four; the lengths of the sequences 
are twenty-five, thirty, and thirty-five; the value of the gap 
constraint is five; and the number of sequences is three 
thousand. According to Figure 5c, the runtime rates are 



14.04, 22.88, 28.31, 41.41, and 68.93 for synthetic DNA 
data. The rate grows invariably when the minimum sup- 
port becomes larger. The performance of DFSG for the 
real DNA data is the same as these for simulated DNA 
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data in the above experiments. This situation confirms 
that DFSG preserves efficiency on real biological data, and 
simulated sequences can validate the performance of 
DFSG correctly. 

Gap constraints, simulated protein sequences, number of 
sequences, length of sequences, and scalability 

DFSG performance and the performance of GenPrefix- 
Span are compared by using gap constraints, number of 
sequences, length of sequences, and simulated protein 
sequences. Scalability of the DFSG algorithm is also 
tested. We raise the value of the gap constraint G, and 
the number of letters A equals four steadily for simu- 
lated DNA data in the experiments of gap constraints. 
DFSG is superior to GenPrefixSpan with variable gap 
constraints according to Figure 6a-d. In the experiments, 
the lengths of the sequences are forty and fifty; and the 
numbers of sequences are one thousand and three thou- 
sand. The execution time of DFSG and that of GenPre- 
fixSpan is raised since the probability of finding 
subsequent frequent items is enhanced, and the number 
of sequential patterns is increased when we increase the 
value of the gap constraint G. 

DFSG is superior to GenPrefixSpan with raised N 
according to Figure 7. In the experiments, the numbers 
of sequences are four thousand, five thousand, six thou- 
sand, seven thousand, eight thousand, and nine thou- 
sand; the number of synthetic DNA data is four; the 
length of the sequences is thirty; the minimum support 
is zero point nine; and the value of the gap constraint is 
three. Additionally, DFSG is more scalable than GenPre- 
fixSpan, although GenPrefixSpan is scalable [26]. The 
runtimes of DFSG seem to be a straight line in Figure 7 
as a result of the proportional scale. DFSG runtimes of 
Figure 7 are 2.781 s (four thousand sequences), 2.890 s 
(five thousand sequences), 2.937 s (six thousand 
sequences), 3.015 s (seven thousand sequences), 3.125 s 
(eight thousand sequences), and 3.187 s (nine thousand 
sequences). The variations of DFSG runtimes are not 
obvious in Figure 7. Furthermore, the runtime rates are 
37.12, 55.38, 77.24, 102.32, 128.77, and 160.72. The run- 
time rate rises when the number of sequences gets lar- 
ger. This experiment confirms that DFSG is more 
efficient than GenPrefixSpan when the number of 
sequences is increased. 

According to Figure 8, DFSG outperforms GenPrefix- 
Span with increased L. The lengths of the sequences are 
forty-five, forty-six, forty-seven, forty-eight, forty-nine, 
and fifty; the number of letters is four; the minimum 
support is zero point nine; the value of the gap con- 
straint is five; and the number of sequences is one 



(a) 




(b) 




(c) 




(d) 




2 3 4 
Gap constraint 

Figure 6 Comparison of the execution time based on synthetic 
DNA sequences for the effect of length of gaps a Execution 
Time {L = 40, A = 4, and W=1000). b Execution Time (L = 40, A = 4, 
and N=3000). c Execution Time [L = 50, A = 4, and W=1000). d 
Execution Time {L = 50, A = 4, and N=3000). 
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Figure 7 Comparison of the execution time based on synthetic DNA sequences for the effect of number of sequences Execution Time 
(L = 30, A = 4, 6 = 3, and 5 = 0.9). 



thousand in the experiments. The DFSG runtime of 
Figure 8 rises steadily when the length of the sequences 
L is increased that is compared to GenPrefixSpan. As a 
result of the proportional scale, the runtimes of DFSG 
for this figure seem to be a straight line more if the 
maximum value of the sequence length is greater than 
fifty. Additionally, the results of the experiment have 
already shown that DFSG significantly outperforms Gen- 
PrefixSpan as L increases. In the following, we use a dif- 
ferent alphabet size to test the execution time of DFSG 
and that of GenPrefixSpan. According to Figure 9, 
DFSG mines much faster than GenPrefixSpan when the 
number of letters A equals twenty for synthetic protein 
data. The length of the sequences is one hundred; the 
number of sequences is five hundred; and the value of 
the gap constraint is twenty in the experiment of syn- 
thetic protein data. 



The number of sequences N is added from one hun- 
dred kilos to five hundred kilos to experiment with DFSG 
scalability. In the experiment, the runtimes of DFSG are 
36.218, 73.640, 108.390, 152.109, and 206.640 s, and the 
numbers of sequences are one hundred kilos, two hun- 
dred kilos, three hundred kilos, four hundred kilos, and 
five hundred kilos, respectively. In the experimental 
results, the execution time of DFSG is scalable when the 
numbers of sequences get larger. The growth rate of 
DFSG runtime is steady. This experiment confirms that 
DFSG has the scalability for large biological data. In the 
experiment, the number of letters for the synthetic DNA 
data is four; the minimum support is zero point nine; the 
value of the gap constraint is three; and the length of the 
sequences is thirty. The total experiments show that 
DFSG is superior to GenPrefixSpan in various features, 
including synthetic DNA/protein data, real DNA data, 
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Figure 8 Comparison of the execution time based on synthetic DNA sequences for the effect of length of sequences Execution Time 
(A = 4, W=1000, G = 5, and 5 = 0.9). 
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Figure 9 Comparison of execution time based on simulated protein sequences. Execution Time {A = 20, L = 100, G = 20, and N = 500) 



length of sequences, number of sequences, and gap 
constraints. 

Conclusions 

Mining sequential patterns of biological sequences is 
important in computational biology. However, traditional 
sequential pattern mining methods difficultly cope with 
biological sequences whose sequence lengths are long, 
and alphabets are small. Furthermore, gap constraints for 
motif discovery are also important in computational biol- 
ogy. Therefore, DFSG is proposed to efficiently mine 
motifs of biological sequences with gap constraints. 
DFSG can help biologists discover all-length motifs with 
gap constraints, and when mining biological sequences, 
DFSG is more efficient than GenPrefixSpan. In our future 
works, we will devise efficient or effective algorithms to 
help mine biological sequences. 
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